Karl Oskar Ekvall
Fall 2020
Director: Matteo Bottai (matteo.bottai@ki.se)
Teacher: Karl Oskar Ekvall (karl.oskar.ekvall@ki.se)
Administrator: Johanna Bergman (johanna.bergman@ki.se)
Concepts and tools to understand scientific literature and perform statistical analyses.
Descriptive and exploratory statistics
Probability
Estimation and Inference
We will cover a lot of material in a short period of time. To follow:
Make good use of the lectures by asking questions
Know that you are not expected to understand everything at once
Go through slides (koekvall.github.io/biostat.html) after class and make sure everything is clear
Random variables and realizations
Understanding data using descriptive statistics and plots
Introduction to software (R and RStudio)
Data are often outcomes of random experiments or sampling.
If we repeat an experiment, we usually get different data.
Example
Five patients are selected at random to receive a new drug.
The effectiveness of the drug typically depends on who is in the treatment group.
A random variable is a (often numerical) measurement of the outcome of an experiment yet to be performed.
Example
Let \( X \) be the number of minutes before 9 am that the first student joins Zoom tomorrow.
Non-example
The number of minutes before 9 am that the first student joined Zoom today.
A realization or an observation is the particular value a random variable took when the experiment was performed.
It is common to use capital letters for random variables (\( X \)) and lower case letters for realizations (\( x \)).
Example
If tomorrow it turns out that the first student joins Zoom at 8.55, the realized or observed value of \( X \) is \( x = 5 \).
Example
The first student joined Zoom \( y \) minutes before 9 am today.
We typically assume data are realizations of random variables.
Example
Select 10 students in the class at random and measure their heights in centimeters.
Let \( X_i \) denote the height of the \( i \)th randomly selected person, \( i = 1, \dots, 10 \).
After having performed the experiment, our data, or sample, may consist of the realizations
\[ \{x_1, x_2, \dots, x_{10}\} = \{165, 181, \dots, 169\}. \]
Even a moderately large dataset is difficult to understand by just looking at.
Descriptive statistics can help.
Definition: A statistic is a function of the data.
Definition: A statistic is something you can compute if you are given data.
A descriptive statistic is a statistic intended to tell you something useful about the data.
Example
The sample mean or sample average is
\[ \bar{x} = \frac{1}{n}\sum_{i = 1}^n x_i = \frac{1}{n}(x_1 + \cdots + x_n). \]
The sample mean is a statistic because you can compute it if I tell you what \( x_1, \dots, x_n \) are.
More generally, we will consider descriptive statistics that quantify:
Central tendency / location (e.g. sample mean)
Dispersion / variability
Asymmetry
Association
In R, you can calculate the sample mean of any vector of observations easily:
heights <- c(165, 181, 177, 189, 185, 155, 170, 179, 172, 169)
mean(heights)
[1] 174.2
Median
The middle number of the sorted data if \( n \) is odd, and the average of the two middle numbers if \( n \) is even.
sort(heights)
[1] 155 165 169 170 172 177 179 181 185 189
median(heights)
[1] 174.5
According to Credit Suisse's global wealth report:
Average wealth of an adult in Sweden in 2019 was 256,000 USD.
Median was 42,000 USD.
Depending on situation, one or the other may be a more useful measure of location.
The mean is sensitive to outliers, the median is not.
income <- c(50, 30, 60, 55, 75, 300) # 1000 USD / year
mean(income)
[1] 95
median(income)
[1] 57.5
Sample quantiles
Quartiles partition the sample into four equally sized subsets
sort(heights)
[1] 155 165 169 170 172 177 179 181 185 189
quantile(heights, c(0.25, 0.5, 0.75))
25% 50% 75%
169.25 174.50 180.50
Sample percentiles
Percentiles partition the sample into 100 equally sized subsets
quantile(heights, 0.1) # 10th percentile
10%
164
Sample variance and standard deviation
\[ s^2 = \frac{1}{n - 1}\sum_{i = 1}^n(x_i - \bar{x})^2, \quad \text{and}\quad s = \sqrt{s^2} \]
However, in general, it does not equal the mean absolute deviation:
\[ s \neq \frac{1}{n}\sum_{i = 1}^n \vert x_i - \bar{x}\vert. \]
Example
heights - mean(heights)
[1] -9.2 6.8 2.8 14.8 10.8 -19.2 -4.2 4.8 -2.2 -5.2
(heights - mean(heights))^2
[1] 84.64 46.24 7.84 219.04 116.64 368.64 17.64 23.04 4.84 27.04
sum((heights - mean(heights))^2) / 9 # Sample var
[1] 101.7333
sqrt(sum((heights - mean(heights))^2) / 9) # Sample sd
[1] 10.08629
Ranges
The range is \( \max_i x_i - \min_i x_i \) and the inter-quartile range (IQR) is the difference between the third and first quartile.
max(heights) - min(heights) # Range
[1] 34
IQR(heights)
[1] 11.25
To practice, we can make the following plot:
plot(heights)
abline(h = mean(heights), lwd = 2)
abline(h = median(heights), lty = 2, lwd = 2)
abline(h = quantile(heights, 0.25), lty = 2, lwd = 2)
abline(h = quantile(heights, 0.75), lty = 2, lwd = 2)
abline(h = mean(heights) + sd(heights), lty = 3, lwd = 2)
abline(h = mean(heights) - sd(heights), lty = 3, lwd = 2)
Suppose our data is a sample of \( n \) pairs:
\[ \{(x_1, y_1), \dots, (x_n, y_n)\}. \]
As before, we can summarize properties of the \( y_i \) and \( x_i \), e.g.
\[ \bar{x} = \frac{1}{n}\sum_{i = 1}^nx_i, ~~ \bar{y} = \frac{1}{n}\sum_{i = 1}^ny_i, ~~ s^2_x = \frac{1}{n - 1}\sum_{i = 1}^n(x_i - \bar{x})^2, ~~ s_y^2 = \frac{1}{n - 1}\sum_{i = 1}^n(y_i - \bar{y})^2 \]
But how can we quantify the association between them?
Sample covariance \[ s_{xy} = \frac{1}{n - 1}\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y}). \] Intuition: Two variables have positive covariance if they tend to be larger (or smaller) than their respective means at the same time.
Notice \( s_{xx} = s_x^2 \) and \( s_{xy} = s_{yx} \).
Sample correlation \[ \rho_{xy} = \frac{s_{xy}}{s_x s_y}, \]
Example \[ \text{weights} = \{y_1, \dots, y_{10}\} = \{69, \dots, 67 \} \]
weights <- c(69, 78, 74, 90, 80, 61, 76, 77, 69, 67)
plot(heights, weights)
abline(v = mean(heights), lwd = 2)
abline(h = mean(weights), lwd = 2)
Example
The plot indicates a positive relationship between the variables
Their sample covariance is
weights - mean(weights)
[1] -5.1 3.9 -0.1 15.9 5.9 -13.1 1.9 2.9 -5.1 -7.1
heights - mean(heights)
[1] -9.2 6.8 2.8 14.8 10.8 -19.2 -4.2 4.8 -2.2 -5.2
sum((weights - mean(weights)) * (heights - mean(heights))) / 9
[1] 75.31111
Example
cov(heights, weights) / (sd(heights) * sd(weights))
[1] 0.9230557
cor(weights, heights)
[1] 0.9230557
Since we know \( -1\leq \rho_{xy} \leq 1 \), that \( \rho_{xy} = 0.92 \) indicates a strong and positive relationship between height and weight.
Descriptive statistics are just that—descriptions of your sample.
Four different datasets, each with \( n = 11 \) observations of 2 variables
data(anscombe)
round(apply(anscombe, 2, mean), 2)
x1 x2 x3 x4 y1 y2 y3 y4
9.0 9.0 9.0 9.0 7.5 7.5 7.5 7.5
round(apply(anscombe, 2, sd), 2)
x1 x2 x3 x4 y1 y2 y3 y4
3.32 3.32 3.32 3.32 2.03 2.03 2.03 2.03
round(c(cor(anscombe$x1, anscombe$y1),
cor(anscombe$x2, anscombe$y2),
cor(anscombe$x3, anscombe$y3),
cor(anscombe$x4, anscombe$y4)), 2)
[1] 0.82 0.82 0.82 0.82
Lessons:
Imagine someone told you the sample correlation between the dose of a drug and the number of days until a patient is symptom free is -0.82.
In practice, correlation is often useful but not the whole story.
Let's consider bill lenght and depths in the penguins data.
It's at https://github.com/allisonhorst/palmerpenguins and artwork is by @allison_horst.
# install.packages("palmerpenguins")
library(palmerpenguins)
hist(penguins$bill_length_mm); hist(penguins$bill_depth_mm)
How do the species differ with respect to length and depth?
Is there a correlation between length and depth?
plot(penguins$bill_length_mm, penguins$bill_depth_mm)
Can you see three clusters, corresponding to species, in this plot?
plot(penguins$bill_length_mm, penguins$bill_depth_mm, col = penguins$species, pch = c(16, 17, 18)[penguins$species])
You can certainly see the clusters in this plot!
For penguins of the same species, is there a correlation between bill length and depth?
(This code uses \( \mathtt{dplyr} \); you do not need to learn it)
penguins %>% summarize(r = cor(bill_length_mm, bill_depth_mm, use = "complete"))
# A tibble: 1 x 1
r
<dbl>
1 -0.235
penguins %>% group_by(species) %>% summarize(r = cor(bill_length_mm, bill_depth_mm, use = "complete"))
# A tibble: 3 x 2
species r
<fct> <dbl>
1 Adelie 0.391
2 Chinstrap 0.654
3 Gentoo 0.643
This is known as Simpson's paradox.
str(penguins, vec.len = 1)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 ...
$ flipper_length_mm: int [1:344] 181 186 ...
$ body_mass_g : int [1:344] 3750 3800 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 ...
$ year : int [1:344] 2007 2007 ...
Averages, correlations, and similar measures are mostly meaningful for numerical variables.
Events
Rules of probabilities
Conditional probability and Bayes' theorem
To do more than descriptive and exploratory statistics, we need to consider how the data may have been generated.
Probability theory lets us use statistics to reason about what a particular sample may say about an underlying population.
Sometimes there is an actual population we are sampling from, sometimes it is a theoretical construct.
Everything we will cover can be motivated formally using mathematics.
Because this is not a course in mathematics, we will state many things without formal motivation.
Remember to ask if you find anything confusing!
Events
Things to which probabilities can be assigned are called events.
An event is a set, or collection, of (potential) outcomes.
For example:
We often use letters such as \( A \) and \( B \) to denote events.
At the end of this section, we will answer the following question:
Let \( A \) be the event that a randomly sampled person is a cannabis user.
Let \( B \) be the event that a randomly sampled person tests positive for cannabis.
Suppose that the test we are using is right in 90 % of the cases and that 5 % of the population are cannabis users.
What is the probability that a randomly selected person is a cannabis user given that they test positive?
We write the probability of the event \( A \) as \( P(A) \).
For example, it may be that \( P(X = 5) = 1/2 \) or \( P(\text{it rains tomorrow at 5 pm}) = 0.1 \).
We have the following:
We can define new events, for example \( C \) can be defined to be “\( A \) or \( B \)”; that is, \( C \) happens if either \( A \) happens, \( B \) happens, or both \( A \) and \( B \) happen.
We often write \( A\cup B \); you can read this as \( A \) union \( B \).
Example
Let \( A \) be the event that it rains tomorrow at 5 pm and let \( B \) be the event that I am late for class tomorrow. If \( C \) is defined as “\( A \) or \( B \)”, or \( C = A\cup B \), then \( C \) is the event that either it rains tomorrow at 5 pm, or I am late to class tomorrow, or both.
We can also define \( D \) to be “\( A \) and \( B \)”; that is, \( D \) happens if and only if both \( A \) and \( B \) happen.
We often write \( A\cap B \), which is called the intersection of \( A \) and \( B \).
Example
If \( A \) is the event that it rains tomorrow at 5 pm and \( B \) the event that I am late for class tomorrow, and if \( D \) is “\( A \) and \( B \)”, or \( D = A\cap B \), then \( D \) is the event that it both rains tomorrow at 5 pm and I am late for class. In particular, \( D \) does not happen if only one of \( A \) or \( B \) happens.
Exercise
Show (that is, use the stated facts about probabilities to argue) that \( P(C) \geq P(D) \).
The complement of \( A \) is the event “not \( A \)”.
We often write this as \( A^c \).
The probability of \( A^c \) is always \( 1 - P(A) \).
Motivation
Either \( A \) happens or it doesn't, so \( P(A\cup A^c) = 1 \).
Because \( A \) and \( A^c \) cannot happen at the same time, one of the rules of probabilities says
\[ 1 = P(A \cup A^c) = P(A) + P(A^c). \]
This is a Venn diagram. The rectangle is called the sample space. Events are subsets of the sample space.
You can think of probabilities as sizes of subsets. The “size” of the sample space is always 1.
Find \( A\cup B \), \( A\cap B \), \( A\cap C \), \( A\cup C \), \( A\cup B \cup C \), and \( (A\cap B) \cup (B\cap C) \)
By using Venn diagrams, we can convince ourselves that
\[ P(A \cup B) = P(A) + P(B) - P(A\cap C) \]
Finally, two events \( A \) and \( B \) are called independent if
\[ P(A\cap B) = P(A)P(B). \]
Example
If I roll a regular six-sided die twice, what is the probability that the first roll is 2 and the second is 4? You may assume the rolls are independent.
Answer: Let \( X_1 \) be the result of the first roll and \( X_2 \) that of the second. Then, by independence, \( P(X_1 = 2 \cap X_2 = 4) = P(X_1 = 2)P(X_2 = 4) = (1/6)(1/6) = 1/36 \).
Example
If I roll a regular six-sided die twice, what is the probability that one of the rolls is 2 and the other is 4? You may assume the rolls are independent.
Answer: Let \( X_1 \) be the result of the first roll and \( X_2 \) that of the second. First, let's figure out which outcomes are in our event. One outcome is that the first roll is 2 and the other is 4. Another is that the first is 4 and the other 2. There are no other outcomes in our event.
Thus, we are looking for
\[ P[(X_1 = 2 \cap X_2 = 4)\cup (X_1 = 4 \cap X_2 = 2)]. \]
Because the events in the union are disjoint and the rolls are independent, this is equal to
\[ P(X_1 = 2 \cap X_2 = 4) + P(X_1 = 2 \cap X_2 = 4) = 2/36. \]
Conditional probabilities are like probabilities, but with extra information.
Example
Let \( A \) be the event that a die roll is at least \( 3 \), and let \( B \) be the event that the same roll is even. What is the probability of the roll being at least 3 if we are told it is even? That is, what is the probability of \( A \) given \( B \), or
\[ P(A\mid B)? \]
Given that the roll is even, it has to be one of 2, 4, and 6.
Because every outcome is equally likely and two of those three are greater than 3, intuition suggests
\[ P(A\mid B) = 2/3. \]
Intuition is usually not good at probability!
The conditional probability can be calculated as
\[ P(A\mid B) = \frac{P(A\cap B)}{P(B)}. \]
In this case, intuition was right because
\[ \frac{P(A\cap B)}{P(B)} = \frac{P(X = 4 \cup X = 6)}{P(X = 2 \cup X = 4 \cup X = 6)} = \frac{2/6}{3 / 6} = 2/3 \]
Note: if \( P(B) = 0 \), then \( P(A\mid B) \) is not defined.
Much of applied research is concerned with conditional probabilities.
Example
If I randomly sample a patient to a study, what is the probability that they develop lung cancer given that they are a smoker?
If this conditional probability is significantly greater than the probability that they develop lung cancer given that they are not a smoker, then this can be an indication that smoking increases the risk of developing lung cancer.
Recall that \( A \) and \( B \) are independent if \( P(A\cap B) = P(A)P(B) \).
Thus, if \( A \) and \( B \) are independent,
\[ P(A\mid B) = \frac{P(A\cap B)}{P(B)} = \frac{P(A)P(B)}{P(B)} = P(A). \]
Knowing \( B \) gives you no information about how likely \( A \) is to occur.
The probability that \( A \) happens is the probability that \( A \) and \( B \) happen, or that \( A \) and \( B^c \) happen.
Draw a Venn diagram to convince yourself that
\[ A = (A\cap B)\cup (A\cap B^c) \]
Because \( A\cap B \) and \( A\cap B^c \) are disjoint,
\[ P(A) = P(A\cap B) + P(A\cap B^c) = P(A\mid B)P(B) + P(A\mid B^c)P(B^c). \]
\[ P(A\mid B) = \frac{P(B\mid A) P(A)}{P(B)} \]
Example (from https://en.wikipedia.org/wiki/Bayes%27_theorem)
Let \( A \) be the event that a randomly sampled person is a cannabis user.
Let \( B \) be the event that the randomly sampled person tests positive for cannabis.
What is the probability that a randomly sampled person who tests positive for cannabis is a cannabis user?
Make the following assumptions:
The test's true positive rate is 0.9, or \( P(B \mid A) = 0.9 \).
The test's true negative rate is 0.95, or \( P(B^c \mid A^c) = 0.95 \)
In the population, 5 % are cannabis users, so \( P(A) = 0.05 \).
We want to compute \( P(A\mid B) \), and Bayes' theorem tells us
\[ P(A\mid B) = \frac{P(B\mid A) P(A)}{P(B)} = \frac{0.9\times 0.05}{ P(B)}, \]
The law of total probability tells us \( P(B) = P(B\mid A)P(A) + P(B\mid A^c)P(A^c) \).
We know \( P(B\mid A)P(A) = 0.9 \times 0.05 \) and \( P(A^c) = 1 - P(A) = 0.95 \).
We can compute \( P(B\mid A^c) = 1 - P(B^c \mid A^c) = 0.05 \).
We get
\[ P(A\mid B) = \frac{0.9\times 0.05}{0.9\times 0.05 + 0.05 \times 0.95} = \frac{90}{90 + 95} \approx 0.49. \]
Even after having observed a positive test, it is more likely the person is not a user.
Intuition is often wrong about Bayes' theorem because intuition usually suggests that if you use an accurate test, then you can trust its result.
\( P(A\mid B) = P(B\mid A)P(A) / P(B) \) (Bayes' theorem, follows from 5.)
Remember to draw a Venn diagram if you are unsure!
Exercise
Is it true, for any events \( A \) and \( B \), that \( P(A) \geq P(A\cap B) \) and \( P(A) \leq P(A \cup B) \)?
Why?
Can the inequalities be equalities for some specific choices of \( A \) and \( B \)?
Discrete random variables
Continuous random variables
Distributions
Most events we will calculate probabilities for involve discrete or continuous random variables.
Recall that a random variable is a, typically numerical, measurement of the outcome of an experiment yet to be performed.
Discrete random variables
Discrete variables can take at most countably many values (countable support).
“Countably” has a mathematical definition, but it is quite literal: you can count the possible values.
They can can be finitely or infinitely many.
Example
The set \( \{1, 1/2, 1/3, 1/4, \dots\} \) is countable and infinite.
Example
Suppose I flip a coin and if it comes up heads, I flip again. If it comes up tails, I stop.
Let \( X \) be the number of flips I will have made at the end of this experiment.
It is possible that I flip \( 1000 \) heads in a row, but highly improbable.
The same is true for any integer. Thus, \( X \) can take the values \( 1, 2, 3, \dots \) and is therefore a discrete random variable that can take infinitely many values.
Fun fact
With advanced probability theory one can show that the probability of continuing flipping forever is zero.
Continuous random variables
Continuous variables can take an uncountable number of values (uncountable support). That is, you cannot count the possible values even if you keep counting forever.
In practice, we only deal with discrete variables because we cannot measure or store with infinite precision.
Nevertheless, it is often a useful approximation to assume continuous variables.
Example
The number of (decimal) numbers between \( 0 \) and \( 1 \).
The time it takes for my daughter to tie her shoes in the morning.
Distribution
The rule (law) telling us the probabilities that \( X \) takes certain values is called the distribution of \( X \).
Every random variable \( X \) has a cumulative distribution function (cdf).
Cumulative distribution function
The cdf of a random variable \( X \) is the function defined by \( F(x) = P(X \leq x) \).
You plug in \( x \), the cdf tells you the probability that \( X \) is less than or equal to \( x \).
It may not be obvious, but the cdf tells you everything there is to know about the distribution of \( X \).
We say that \( F \) characterizes the distribution of \( X \).
In theory, if you know \( F \), you can calculate any probabilities involving \( X \).
Discrete probability distributions also have a probability mass function (pmf).
Probability mass function
The pmf of a discrete \( X \) is the function defined by \( f(x) = P(X = x) \).
You plug in \( x \), the pmf tells you the probability that \( X = x \).
Example
We say that \( X \) has a Bernoulli distribution with parameter \( 0 \leq p \leq 1 \), or \( X\sim \mathrm{Ber}(p) \), if
\[
P(X = 1) = p \\
P(X = 0) = 1 - p \\
\]
If \( X \) is the number of successes in \( n \) independent trials, each with success probability \( p \), then \( X \) has a binomial distribution with parameters \( n \) and \( p \), or \( X\sim \mathrm{Bin}(n, p) \).
Example
Suppose I flip \( n = 10 \) coins and let \( X \) be the number of heads I get, then \( X \) has a binomial distribution with parameters \( n = 10 \) and \( p = 1/2 \) (assuming the coin is fair).
The pmf of \( X \) tells us the probability that \( X = x \) for every possible \( x \).
The binomial has pmf \[ f(x) = \binom{n}{x} p^x(1 - p)^{n - x}, \quad x = 0, \dots, n. \] The binomial coefficient is the number of ways to select \( x \) from \( n \).
Luckily, R can compute this for us.
Example
Suppose I flip \( 10 \) coints and let \( X_i \) be one if the \( i \)th flip is heads, and 0 otherwise. Then \( \sum_{i = 1}^n X_i \) is the number of heads I get in \( 10 \) flips. This illustrates the following fact:
If \( X_1, \dots, X_n \) are independent \( \mathrm{Ber}(p) \), then \( X = \sum_{i = 1}^n X_i \sim \mathrm{Bin}(n, p) \).
Example
Suppose 50 % of all penguins in the population are of the Adelie species.
Assuming penguins were sampled independently, what was the probability of obtaining the number of Adelie penguins in the penguins data?
table(penguins$species)
Adelie Chinstrap Gentoo
152 68 124
# The probability of 'x' successes in 'size' independent trials with success probability 'prob'
dbinom(x = 152, size = 152 + 68 + 124, prob = 0.5) # d for density
[1] 0.00420746
Example
What was the probability of obtaining at least as many Adelie penguins as in the penguins data?
# Use the cdf
# 1 - P(X <= n_success - 1) = P(X > n_success - 1) = P(X >= n_success)
1 - pbinom(q = 152 - 1, size = 152 + 68 + 124, prob = 0.5)
[1] 0.9865395
Exercise
What was the probability of obtaining fewer Adelie penguins than in the penguins data?
Exercise
Consider rolling three dice and let \( X \) be the number of 6s rolled. What is the distribution of \( X \)?
Exercise
Consider rolling two dice and let \( X \) be the number of 1s rolled. Find the probability mass function for \( X \).
Answer: First note that \( X \) can take three values: \( 0, 1 \), or \( 2 \). Thus, we need to find \( f(x) = P(X = x) \) for \( x = 0, 1, 2 \). Let \( X_i \) be one if the \( i \)th roll is 1 and zero otherwise. The event that \( X = 0 \) is the same as \( X_1 = 0 \cap X_2 = 0 \). Assuming the rolls are independent, one of the rules of probabilities says
\[ P(X = 0) = P(X_1 = 0 \cap X_2 = 0) = P(X_1 = 0)P(X_2 = 0)= (5/6)(5/6) = 25/36. \]
The event \( X = 1 \) consists of the outcomes \( X_1 = 0 \cap X_2 = 1 \) and \( X_1 = 1 \cap X_2 = 0 \). That is,
\[ (X = 1) = (X_1 = 0 \cap X_2 = 1) \cup (X_1 = 1 \cap X_2 = 0) \]
The events in the union are disjoint, so the rules of probabilities say
\[ P(X = 1) = P(X_1 = 0 \cap X_2 = 1) + P(X_1 = 1 \cap X_2 = 0), \]
which, assuming independence, is
\[ P(X_1 = 0)P(X_2 = 1) + P(X_1 = 1)P(X_2 = 0) = (5/6)(1/6) + (1/6)(5/6) = 10/36. \]
It remains to find \( P(X = 2) \). Can do similar calculation, or note that
\[ (X = 2)^c = (X = 0) \cup (X = 1) \]
so
\[ P(X = 2) = 1- P(X \neq 2) = 1 - (25/36 + 10/36) = 1/36. \]
One of the most commonly used distributions in practice is the Poisson distribution.
Poisson distribution
A random variable \( X \) has a Poisson distribution with parameter \( \lambda > 0 \) if it has pmf
\[ f(x) = e^{-\lambda} \lambda^x / x!, \quad x! = x (x - 1)(x - 2)\cdots 2, \quad x = 0, 1, \dots \]
x_vals <- 0:10; lambda <- 4; pmf <- dpois(x_vals, lambda)
plot(x_vals, pmf , type = "h"); points(x_vals, pmf)
For a discrete \( X \), its mean (expected value) and variance are
\[ \mu = E(X) = \sum_{x}x f(x) = \sum_{x} x P(X = x) \]
\[ \sigma^2 = \mathrm{var}(X) = \sum_x(x - \mu)^2 f(x) = \sum_x(x - \mu)^2 P(X = x) \geq 0 \]
The sums are over all \( x \) such that \( P(X = x) > 0 \).
These are average or expected values of \( X \) and \( (X - \mu)^2 \).
The standard deviation of \( X \) is \( \sqrt{\mathrm{var}(X)} \).
Intuition
If \( X \) has large mean, then if we observe many independent realizations they will be large on average.
If \( X \) has large variance, then if we observe many independent realizations they will be very different.
Exercise [*]
Show (or remember) that
Example
The mean and variance of \( X\sim \mathrm{Ber}(p) \) is
\[ \mu = 1 \times P(X = 1) + 0 \times P(X = 0) = 1 \times p + 0 \times (1 - p) = p \]
\[ \sigma^2 = (1 - p)^2 \times P(X = 1) + (0 - p)^2 \times P(X = 0) = (1 - p)^2 p + p^2(1 - p) = p - p^2. \]
One can show that if \( X \) is binomial with parameters \( n \) and \( p \) and \( Y \) is Poisson with parameter \( \lambda \), then
\[ E(X) = np,\quad \mathrm{var}(X) = n(p - p^2) \]
\[ E(Y) = \lambda,\quad \mathrm{var}(Y) = \lambda. \]
Continuous variables have uncountable support.
The distribution of continuous random variables are characterized by a probability density function (pdf) \( f(x) \).
The area under the graph of a pdf tells you the probability of \( X \) taking its value in that region.
In general,
\[ P(a < X \leq b) = \int_a^b f(x)\, \mathrm{d} x = F(b) - F(a). \]
You will not have to integrate anything analytically in this class—we will use R!
In R, you can evaluate cdfs for common distributions.
pnorm(0) - pnorm(-2) # The area in the previous slide
[1] 0.4772499
You should know:
The total area under a pdf is \( 1 \) (probability that \( X \) is between \( -\infty \) and \( \infty \))
The probability that \( X \) is between \( a \) and \( b \) is the probability that \( X \) is less than \( b \) minus the probability that \( X \) is less than \( a \), so we can compute it as \( P(a < X < b) = F(b) - F(a) \).
Exercise
Use basic rules of probabilities to explain why the second point is true.
Why don't we characterize continuous distributions by a pmf \( f(x) = P(X = x) \)?
It is outside the scope of this class to prove, but you should know that, for a continuous \( X \),
\[ P(X = x) = 0 \quad \text{for every $x$!} \]
A continuous \( X \) has mean and variance
\[ \mu = E(X) = \int_{-\infty}^{\infty} x f(x)\, \mathrm{d}x \] and \[ \sigma^2 = \int_{-\infty}^{\infty} (x - \mu)^2 f(x)\, \mathrm{d}x. \]
Common continuous distributions
Normal with mean \( \mu \) and variance \( \sigma^2 \): \( f(x; \mu, \sigma^2) = e^{-(x - \mu)^2 / (2\sigma^2)} / \sqrt{2\pi \sigma^2} \)
Uniform on \( [a, b] \): \( f(x) = 1 / (b - a) \) for \( a \leq x\leq b \)
Exponential with parameter \( \lambda > 0 \): \( f(x) = \lambda e^{-\lambda x} \) for \( x \geq 0 \)
The probability that they are less than 0.9:
pnorm(0.9)
[1] 0.8159399
pexp(0.9)
[1] 0.5934303
punif(0.9)
[1] 0.9
Exercise
What is the probability that they are greater than \( 0.8 \)? How to calculate it in R?
One can show that if \( X \) is exponential and \( Y \) uniform on \( [a, b] \), then
\[ E(X) = 1 / \lambda,\quad \mathrm{var}(X) = 1 / \lambda^2 \]
\[ E(Y) = (b - a)/2,\quad \mathrm{var}(Y) = (b - a)^2 /12. \]
Recall that the histogram tells you how many observations in a certain interval.
hist(penguins$flipper_length_mm)
If you make the intervals smaller and the sample larger, eventually the histogram should look like the pdf of a randomly selected penguin's flipper length.
x <- rnorm(5); y <- rnorm(10); z <- rnorm(50); w <- rnorm(5000)
Thus, if the sample is large enough, we should be say something about penguins in general, not just the ones in our sample; that's the goal of statistical inference!